Add perf profiling infrastructure, switch HXB to uncompressed storage, replace Domainslib with WorkerPool, and document compiler hotspots#77
Conversation
…-times data Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
@copilot This is usually fine in OCaml due to its generational GC, so you might be jumping to conclusions too quickly here. But memtrace sounds like a good plan, go ahead and do that!
Indeed, please identify the places where this occurs.
Agreed, let's store without compression.
I'm not sure what you mean here, please implement it so I can take a look!
I don't have a good solution for this because the Domainslib API only allows fixed domain pools (I think). Can you think of a better approach for managing domains? Also, please be aware that your opam installation got stuck waiting for a y input, so make sure |
…ith perf/olly/memtrace findings Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Addressed in 930c728: GC / memtrace:
Full list in HXB compression: Changed to
Domainslib: The Also noted re: |
|
@copilot
|
…y inspection; propose Domain.spawn framework Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Addressed in 7ed90ce and c8e7434.
The sites I previously listed that are NOT polymorphic (corrected):
Still truly polymorphic (calls
Full corrected analysis with assembly evidence in Domain.spawn framework: Added a concrete proposal in
|
|
@copilot I'm very curious about that parallel redesign. Go ahead and implement it, and make sure to measure the difference accurately. If this leads to the removal of Domainslib as a dependency, that'd be a bonus. |
…ro idle CPU Replaces the Domainslib dependency with a custom WorkerPool that uses Domain.spawn for workers that block on Condition.wait between calls. Key design: - Workers sleep (zero CPU) when no parallel_for is active - Nested parallel_for calls detected via atomic busy flag, fall back to sequential (matches Domainslib's effective behavior) - Exception propagation with backtrace from worker domains - ManagedPool retained as lazy wrapper with same lifecycle Benchmark (4-core CI runner): Eval unit tests: 2762ms (Domainslib) → 2809ms (WorkerPool) [+1.7%, noise] JVM compilation: 1062ms (Domainslib) → 1070ms (WorkerPool) [+0.8%, noise] Removes domainslib from src/dune and haxe.opam. Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
…ughput Co-authored-by: Simn <634365+Simn@users.noreply.github.com>
Implemented in d9dd515. Design: Benchmarks (4-core CI, median):
Within noise. The throughput is equivalent; the benefit is zero idle CPU (workers sleep instead of spin-waiting). |
Profiling setup and analysis for the Haxe compiler eval path and HXB roundtrip, identifying GC pressure, type-traversal hotspots, and zip I/O as the primary cost centres. Includes a concrete performance fix for HXB zip I/O, a replacement of Domainslib with a custom WorkerPool, a comprehensive profiling guide, and a corrected analysis of polymorphic comparison sites verified via OCaml assembly inspection.
Infrastructure changes
src/dune: adds aprofilebuild environment with(ocamlopt_flags (:standard -O2 -g -unbox-closures))so a DWARF-instrumented binary can be produced without touching the release build:perf record --call-graph dwarfand resolves OCaml symbols correctly.Code changes
src/compiler/generate.ml: switched HXB zip compression from level 6 (Deflated) to level 0 (Stored). This eliminates all zlib compression/decompression overhead:hxblib get bytes: 0.090 s → 0.021 s (4.3× faster)src/context/parallel.ml: replaced Domainslib with a customWorkerPoolusingDomain.spawn+Condition.wait. Workers sleep (zero CPU) betweenparallel_forcalls instead of spin-waiting. Key design points:Condition.waitwhen idleparallel_forcalls (e.g. analyzer iterating types → fields) detected via atomicbusyflag and fall back to sequential, preventing state corruptionManagedPoolretained as a thin lazy wrapper matching the old lifecyclesrc/compiler/server/serverCompilationContext.ml: updatedManagedPool.createto use the new argument-free constructor.src/duneandhaxe.opam: removeddomainslibdependency.Documentation
extra/PROFILING.md: comprehensive profiling guide covering how to useperf,olly gc-stats, and--timesflags, with full results, recommendations, and documentation of the WorkerPool architecture.How to profile
Findings
Eval unit tests (
--times+perf+olly, ~2.7 s)do_some_marking+ alloc/promote/sweep)ollyconfirmed)ancestorHasInitializeUtestalone)handle_abstract_casts,fix_return_dynamic_from_void_function)compare_val(polymorphic=/compare)compare_valanalysis (3.1 % of perf time) — verified via assembly inspectionOCaml's native compiler specializes
=for types with all-constant constructors and for comparisons against literal constant constructors. Verified by compiling test files withocamlfind ocamlopt -Sand inspecting the generated assembly.Truly polymorphic (calls
caml_equal):typeloadCheck.ml:187—field_kind = field_kind(hasVar of var_kind)optimizerTexpr.ml:189—tconstant = tconstant(hasTInt of int32,TString of string, etc.)fields.ml:141,194—e.eexpr = TConst TSuper(structured literal, but shallow comparison)string list * stringnullSafety.ml— polymorphicHashtblwithsafety_subjectkeys (also contributes tocaml_hash2.4%)NOT polymorphic (specialized by OCaml, corrected from initial analysis):
typeloadCheck.ml:66—m1 = MethDynamic:method_kindall-constant constructors → directcmpqtypeloadCheck.ml:432—mkind m1 = mkind m2: projects toint→ directcmpqfields.ml:41—!(a.a_status) = Const: literal constant constructor → directcmpqHXB roundtrip
HxbWriter.write_type_instanceandZlib.update_crcvisible in perf.Storedcompression,hxblib get bytesdropped from 0.090 s to 0.021 s.GC overhead (
olly gc-stats)WorkerPool benchmark (4-core CI runner)
Throughput is within measurement noise. The primary benefit is eliminating idle CPU consumption: Domainslib workers spin-wait even when no work is available, whereas WorkerPool workers block on
Condition.waitand consume zero CPU between parallel sections.Remaining improvement opportunities
compare_val— replace polymorphic(=)/comparewith typed equality in the confirmed polymorphic sites (field_kind,tconstant, path comparisons). SwitchnullSafety.mlto functorized hash tables with custom hash/equal forsafety_subject.💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.